Information Extraction from Web Documents Based on Local Unranked Tree Automaton Inference
نویسندگان
چکیده
Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.
منابع مشابه
Rewrite-Based Verification of XML Updates
We propose a model for XML update primitives of the W3C XQuery Update Facility as parameterized rewriting rules of the form: ”insert an unranked tree from a regular tree language L as the first child of a node labeled by a”. For these rules, we give type inference algorithms, considering types defined by several classes of unranked tree automata. These type inference algorithms are directly app...
متن کاملWrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata
A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the la...
متن کاملOn Probability Distributions for Trees: Representations, Inference and Learning
We study probability distributions over free algebras of trees. Probability distributions can be seen as particular (formal power) tree series [BR82; EK03], i.e. mappings from trees to a semiring K. A widely studied class of tree series is the class of rational (or recognizable) tree series which can be defined either in an algebraic way or by means of multiplicity tree automata. We argue that ...
متن کاملLearning (k, l)-Contextual Tree Languages for Information Extraction
Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this s...
متن کاملInformation extraction from structured documents using k-testable tree automaton inference
Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree auto...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003